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Cd 1. Introduction 

^— > 

Let (A, y) ^ FxY be an M™ x {1, . . . ,K} random pair, where X is the feature vector and 
^ y is the class label. In statistical pattern recognition (see, e.g., [Hj, [7]) one seeks a classifier 

g : M™ — > {!,..., iiT} such that the probability of misclassification L{g) — P{g{X) ^ Y} is 
^^ acceptably small. However, in the modern world the feature vector X is generally a random 

variable of high dimension m, and it is often beneficial to carry out the classification in some lower 
dimension d {1 < d < m). Therefore dimension reduction is often applied to first embed X from 
M™ to M'*, prior to subsequent classification. 

Herein we consider only linear projections, which are commonly used and are the foundation 
for many nonlinear methods. We denote a linear projection A : M™ — > M'* by an to x d matrix 
A; then A X (the ' sign denotes transpose) is the projected feature vector in M.'^. It follows that 
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the classification error for a fixed classifier g (whose domain is W^ from now on) is given by La — 
P{g{Xx)^Y}. 

Given a distribution Fxyi a classifier g, and a nonempty set of linear projections A, we define 
an optimal projection A* G argminyig^{L^} and denote the corresponding minimum error as La'- 
The set A and the existence of A* are discussed in Section [2] and Assumption [T] Roughly speaking, 
La" is the minimum error one can hope to achieve by choosing A cleverly among linear projections. 

Assuming the classifier g is specified, the crucial problem is how to choose the dimension reduc- 
tion method. If we have only X available as the feature vector, then PCA (Principal Component 
Analysis) [1, is a natural choice, which is applied for classification in 15j. On the other hand, 
if there is an auxiliary feature Zi of dimension mi available, that is, {X,Zi,Y) ^ FxziY on 
M™ X M™! X {1, . . . , K}, then CCA (Canonical Correlation Analysis) |TU] is applicable on the pair 
{X, Z\) to derive the projection A, which is used in [H]. In general, if there are S auxiliary features 
{Zs e M'"= , s = 1, . . . , S'} (we always assume \ <d< min {m, mi, . . . , rris}), then GCCA (General- 
ized CCA) [TT] is applicable on {X, Zi,- ■ ■ , Zs) to derive A based on X and the auxiliary features 

{Zs}. 

Note that our classification task remains the same, so that at the classification step we observe 
only X but not {Zs); and so by "GCCA/CCA is applicable" we mean "GCCA/CCA can be used 
to derive the projection matrix A for use in the classifier g{A AT)". Furthermore, although CCA 
is a special case of GCCA, for clarity purposes we shall assume GCCA uses at least two auxiliary 
features whenever GCCA is compared to CCA. 

In this paper we concentrate our theoretical analysis on GCCA/CCA, and derive conditions im- 
plying the superiority of GCCA for classification purposes. Let us say the joint feature (AT, Zi,- ■ ■ ,Zs) 
Fs+i, and a projection matrix A derived from GCCA/CCA using X and s auxiliary features is de- 
noted by Ag^i. Our main objective is to derive sufficient conditions on _F3 such that if max {^^2} — 
La-, then La^ = La*, as well as sufficient conditions such that La* — La^ < minjL^j}. (Note 
that when there are two auxiliary features, A2 may come from applying CCA to either (X,Zi) or 
(AT, Z2); hence the 'max' and 'min'.) The conditions and necessary prerequisites are discussed in 
Section [2| and the theorems follow in Section [3j Our theoretical results are illustrated via sim- 
ulation, as well as a real data experiment on Wikipedia documents, in Section HI Our previous 
numerical work illustrating GCCA improvement is available in [12] . 
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2. Preliminaries 

Given two auxiliary features Zi and Z2, tlie joint distribution of (X, ^1,^2) is denoted by 
Fz £ r^a, wliere fl^ is a family of multivariate distributions on ]R(™+™i+™2) x}jg overall covariance 
matrix of F3 is denoted by 



D (m+mi +7712 ) X (m+mi +m2 ) 



The overall covariance matrix, along with the individual Ejjf, E^^ and S^^j ^re all assumed finite 
and positive semi-definite with rank no less than d. 

We can consider GCCA/CCA either with the population covariances or with the sample co- 
variances. For our theoretical analysis we consider the population covariances directly, while in 
the numerical section we use the sample covariances, which are asymptotically equivalent under 
standard regularity conditions [T]. 

Identifying the CCA projection A2 = A2{X,Zi) can be approached as the problem of finding 

two sets of unit-length canonical vectors {ai} and {b^} to maximize the correlation between a^X 

and b^Zi for each i = 1, . . . ,d. (The size of a^ is ?n x 1 and the size of hi is mi x 1.) That is, we 

wish to identity 

a^^xzibi 



arg max p, ' ^ ,,' -7 ^ = — . — . , 

subject to the uncorrelated constraints 



n , , - "^^^"J - and o , , - ^^^^i^J - V,' < i 

P{a[X,a'X} - , , / , - ^ ^^°- P{b'.Zi.b]Zi} - . , I , ^ U, Vj < I. 

y^a^Y.xaiJa^'Sxaj y^b^Y^ZibiJbjT^Zibj 



(1) 



Then the m x d matrix A2 — [ai, . . . ,ad] is the CCA projection matrix for X, and A2X G M.'^ is 
the projected feature vector. Alternatively, a different A2 — A2{X, Z2) can be identified. Note that 
the arguments to A2 - {X, Zi) or {X, Z2) - represent the choice of auxiliary features, and will be 
suppressed if the choice is clear or irrelevant in the context. 

To identify the GCCA projection A^ based on (X, ^'1,^2), we are looking for three sets of 



unit-length canonical vectors {ui}, {bt} and {ct} as follows: 

subject to Pla[X,a'^X} = P {b'^ Z ub'^ Z ^} = P{c[Z,,c'^Z2} = 0' ^J < «■ 

Then ^3 = [ai, . . . ,0^^] is the desired GCCA projection. In general, given Fg^i we can derive the 
GCCA projection A^+i for any 1 < s < 5, and CCA is merely a special case for s — 1. The 
exponent r in the GCCA formulation (pi) indicates the specific GCCA criterion, and a common 
practice is to set r = 1 or 2, which maximizes either the sum of correlations or the sum of squared 
correlations |11| . Because our results are shown to hold for any r > 1, we implicitly take r = 1 
unless mentioned otherwise. 

Given Sx, we shall call a.n m x d matrix A = [ai, . . . ,ad] a "potential" GCCA projection if 
and only if its columns {oj} are of unit-length and satisfy the uncorrelated constraints. The set 
containing all potential GCCA projections is denoted hy A — {A\ Pi^' x a x\ = Vi 7^ j and ||ai|| = 
1 Vi}. As a different choice of auxiliary features yields a different projection, we denote the set 
containing the GCCA projections A3 by A^ and the set containing all CCA projections A2 by ^2, 
as well as the set As+i in general. Clearly the elements of As+i as well as A depend on Yjx- Note 
that the PGA projection is also an element of A^ but this is not of any concern in this paper. An 
important special case: A represents the Stiefel manifold [3] (containing all orthogonal projections 
onto dimension d linear subspaces) when Yix is a multiple of the identity. 

Note that the original GCCA/CCA algorithm does not require the norm of a.i to be the same 
for all i. We choose them to be unit-length consistently in order to avoid scaling issues in the 
classification step (alternatively, it is a common practice to set a^Ytxa-i = 1 for all i, which is 
equivalent for our purposes). Also note that the choice of the GCCA/CCA projections can be 
arbitrary. For example, let Sx and E^i be identity matrices and all the singular values of T.xzi be 
the same; then A2{X, Zi) can be chosen arbitrarily in the Stiefel manifold Vd.m- In this case A2 has 
md — ^ degrees of freedom, where md comes from the dimension freedom by repeating singular 

values and ^ comes from the unit-length requirement and uncorrelated constraints. But if T^xz-i 

,2 > 
does not have repeating singular values, A2 represents a fixed subspace and has ~ degrees of 

freedom, which is implied by the fact that two mx d matrices A and B represent the same subspace 

if and only if AA — BB . The same phenomenon applies for any GCCA projection ^s+i- 



Returning to the classification problem: given a classifier g : M'' ^ {1,...,K} for the low- 
dimensional feature vector A X, the error La may differ for different A Cz A. Clearly A is compact 
for finite Ex ^nd {Lyi|j4 G A} is bounded between [0, 1], but an optimal low-dimensional projec- 
tion (with respect to the classification error) is not guaranteed to exist. We make the following 
assumption to avoid non-existence: 

Assumption 1. Given a classifier g, we assume for the theory in the sequel that an optimal pro- 
jection A* = argmin^g_4{L^} exists for any finite "Ex of rank at least d. 

For example, if the class-conditional distributions Fx\Y=k admit probability density functions 
fx\Y=k for k — I, . . . , K, then the assumption holds. (In this case La is continuous with respect 
to A, and thus {_L^|A G A} is compact and admits a minimum.) 

By this assumption, the minimum error La" always exists and it follows that La^^i > La* 
always holds for any s. Note that the optimal projection A* need not be unique, since the existence 
suffices for our purposes. Now we are able to define the notion that GCCA improves CCA using 
La'- 

Definition 1. Assuming the existence of A* , we say GCCA improves CCA within a family of 
distributions fl^ if and only if {F3 G fiajL^^ — La', VA2 G A2} C {F3 G fl^lLA^ = La*, VA3 G 
-43}. 

In general, we say the set of GCCA projections As+i improves the set of GCCA projections 
At+i within fls+i (I < s,t < S) if and only if {Fs+i G ils+i\LAt+i = La', VA^+i G A+i} C 
{Fs+i G ils+il^As+i = La', yAs+i G ^s+i}. (Here the notation "c " indicates proper subset.) 

Put in words, suppose GCCA improves CCA within fl-s- Then the optimality of both CCA pro- 
jections implies the optimality of the GCCA projection, and there exists F^ such that the GCCA 
projection is optimal while at least one of the CCA projections is not. Note that this is not 
equivalent to La^ < -^Aj- 

If JI3 includes every possible multivariate distribution, then GCCA fails to improve CCA. For 
example, if Zi and Z2 are both positively correlated to X but Zi and Z2 are negatively correlated, 
then it might happen that A2 is optimal while A3 is not. Hence we look for a family fl^ imposing 
certain relationships among X and {Zs} such that GCCA is guaranteed to improve CCA. 

First, we transform X by centering and whitening, so the population mean is zero and the pop- 
ulation covariance matrix becomes the identity matrix. Then A consists of orthogonal projections 



onto dimension d linear subspaces, and there exists an orthogonal matrix such that the feature 
vector can be rotated to guarantee A* is equivalent to the subspace M'' spanned by the first d 
coordinate axes. We denote the transformed random variable by X = Hx{X — E{X)), where E{X) 
is the expectation for centering and Hx is a non-singular mx m matrix for whitening and rotation. 
Since the optimal projection for X is spanned by the first d coordinate axes, the form of X based 
on the class label F = {1, . . . , K} can be expressed as: 



X ^ Hx{X - E{X)) 



Uih + U2I2 



UkIk 



w 



(3) 



^K 



where I^ is the class label indicator taking value k with probability pk and X]fe=iPfc ~ 1j each 
C/fe G W^ is the marginal distribution oi X under class fc, and W S W^~'^ is the "irrelevant" marginal 
of X. By the above transformation it holds that E{W) = Q(m-d)-><i a-nd E{WW ) — I(rn-d)x{m~d)- 
Clearly Hx always exists, and there are multiple choices for Hx if A* is not unique. Now we impose 
our conditions on Fs+i and define what we call the similar family. 

Definition 2. We say the family of distributions ^*gii is the similar family if and only if it includes 
every Fs+i such that (X, Zi,- ■ ■ , Zs) ^ -Fs+i satisfies the following conditions: 



Condition (1): For each A* , there exists non-singular matrices Hx G 



and Hz. G 



for all s = 1, . . . , S", such that Equation Q) holds and there exist non-negative scalars Qsk with 



Z,=HzAZs-E{Zs)) 



qsiUih +qs2U2h 



QskUkIk + es 



W, 



(4) 



where Cg represents independent noise and Wg G M™'"^''. Note that unlike Hx, Hz, need only he 
non-singular and Zs are not necessarily whitened and rotated. 

Condition (2): E{UkUf^) = I , and Uk is uncorrelated with W and Wg, for all k ^ 1, . . . , K and 
s = l,...,S. 

Condition (3): cti{E{WsW[)) < ai{E(WWl))aiiE{WWt)) for all 1 < s ^ t < S, where we 
denote cri(I]) as the ith largest singular value for any matrix S henceforth. 

Condition (4): {qski ~ Qsk2){Qtki ~ Qtk2) > /"'^ all 1 < s < t < S and ki,k2 = 1, • . • , K; namely 
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the ordering of coefficients qsk is consistent throughout Zg. 

The purpose of condition (1) is to guarantee that the marginal distribution restricted to A* 
of every transformed auxihary feature under each class is a scalar multiple of the corresponding 
marginal of X plus error. The possible non-uniqueness of A* is (mostly) avoided by requiring 
(1) to hold for any A*, though the transformation matrices and respective scalars probably differ 
under different A* . Condition (2) is to simplify the analysis, without which the proof is much more 
complex. Given conditions (1) and (2), conditions (3) and (4) are technical conditions used in the 
proof. 

3. Main Results 

Theorem 1. GCCA improves CCA in the similar family fJg. 

Therefore it is beneficial to use the GCCA projection A^ within the similar family ilj. Fur- 
thermore, the similar family can be decomposed into three disjoint subsets as follows: Jig = {F^ e 
f7* I max {La J = ^A, = iA-}U{F3 e rj^jmaxlL^J > La^ = L^-jUJi^s e r^^lmaxjL^J > 
La* and La3 > La*}, with all the subsets shown to be non-empty and proper in the proof (we can 
also replace all the 'max' by 'min'). Specifically, if the optimal A* is known (which may be difficult 
in practice) , then one can check which subset a given F^ e il^ belongs to according to Inequality (|6| 
and Inequality (It]) in the proof below. When the distribution lies in the first or the second subset 
above, the GCCA projection performs no worse than the CCA projections. 

The above theorem can be further generalized to ^5+1. 

Corollary 1. For any S > S > 2, the set of GCCA projections -Ag' , j^ improves the set of CCA 
projections A2 in the similar family ^*g,i ■ 

Under a simplified setting, we can also show that the set of GCCA projections continues to 
improve when more auxiliary features are included in deriving the projections. 

Corollary 2. Let us replace condition (4-) by a simplifying condition (4*)-' Ws = Wt and qg^ — qtk 
for all I < s,t < S . Namely the auxiliary features follow the same distribution for s ~ I, . . . , S. 

Then for any S > S > 2, the set of GCCA projections Ag' ^i always improves the set of GCCA 
projections Ag' in the similar family f^J+i. 

7 



4. Numerical Experiments 

To investigate the performance of the GCCA/CCA projections in classification, we present both 
numerical simulation and a real data experiment. We use sample covariances to derive the GCCA 
projections (the algorithm in use is based on [13]) and supervised learning for classification, for 
which LDA (Linear Discriminant Analysis) [5] is the classification rule. 

4-.1. Numerical Simulation 

We start with four random variables Ui,U2 G M"^ and Vi , V2 G M^ all independently normally 
distributed. The parameters are set as follows: E{UiUi) — E{U2U2) ~ Isxs, E{Ui) — —£{1/2) = 
0.23x1, EiViV{) = E{V2Vi) = 0.5/6x6, E{Vi) = £{¥2) = Osxi- 

The three random variables X, Zi^Z2 G M^ are constructed as fohows: 



T^ law 



Uih + U2I2 



ry law 

7 ^1 = 



Q.Wih + 0.5[/2/2 + ei 



ry law 
'^2 — 



O.dUih + O.5U2I2 + £2 

V2 + eA 



(5) 



where 61,62 ^ Af(0, 0.75/3x3), 63,64 ^ A^(0, 0.5/6x6), h and I2 are class label indicators having 
equal probability. Using LDA, it is clear that at d = 3 the optimal projection A* uniquely represents 
the subspace spanned by the first d coordinate axes. Hence we can fit the joint distribution into 
Definition [2] with d = 3, such that qn = qi2 = (721 = 922 = 0.5, W — Vi + V2, W\ = Vi + 63, 
W2 = V2 + 64, etc. This joint distribution is easily checked to satisfy the required conditions, so 
it belongs to n^. Further, by checking Inequality ^ and Inequality Q in the proof, the joint 
distribution is actually an element of the subset {F3 e ilgj maxlL^^} > La^ = La'} G 5^3. So we 
expect GCCA to outperform CCA when projected onto M.^. Note that in this case we can explicitly 
calculate L* for the population model, which is 36.45%. 

For each Monte-Carlo replicate, n = 1500 observations are generated for each random variable. 
That is, {a;(i), . . . ,x(i500)} for x, {z[^\ . . . ,zj'^™'} for Zi and {z^'\ . . . ,z^'^"°^} for Z2. All data 
points are used to learn the GCCA/CCA projections A3/A2 respectively for d — 3. Then the first 
1000 points generated from X are projected and used to train the classifier; the remaining 500 
points are projected and used for classification error testing. The classification error is recorded 
separately for the CCA projections A2{X,Zi) and ^2(^,^2) and for the GCCA projections A3, 
using both sum of correlation (r = 1) and sum of squared correlation (r = 2) criteria. The above is 
done for 500 Monte Carlo replications, with the average classification error and standard deviation 



projections 


CCA on {X,Zi) 


CCAon(X,Z2) 


GCCA (r ^ 1) 


GCCA (r = 2) 


average error {La) 


41.14% 


41.33% 


37.21% 


38.09% 


standard deviation 


0.15% 


0.16% 


0.10% 


0.12% 



Table 1: Simulation Results 



topic 


category 


people 


locations 


date 


math 


class label 


1 


2 


3 


4 


5 


article number 


119 


372 


270 


191 


430 



Table 2: Wikipedia Dataset Topics 

shown in Table[T]for each projection. The GCCA classification error is lower than CCA as expected, 
and is fairly close to the optimal error L* . 



4.2. Real Data 

The real data experiment applies GCCA/CCA to text document classification. The dataset 
is obtained from Wikipedia, an open-source multilingual web-based encyclopedia with millions of 
articles in more than 280 languages. In Wikipedia each article can be related to others in the same 
language, or articles in other languages with the same subject. Articles of the same subject in 
different languages are not necessarily exact translations of one another; it is very likely they are 
written by different people and their contents might differ significantly. 

English articles within a 2-neighborhood of the English article "Algebraic Geometry" are col- 
lected, and the corresponding French articles of those English documents are also collected, which 
totals n — 1382 pairs of articles in English and French. Let af , . . . , 0^332 denote the English articles 
and a{, . . . , 0(332 denote the French articles. All articles are manually labeled into 5 disjoint classes 
(1 — 5) based on their topics, as shown in Table l2J 

For the purposes of GCCA/CCA, first we need to embed each article onto the Euclidean space 
M™ by Multidimensional Scaling (MDS). MDS [HI HI [2] strives to give a Euchdean representation 
while approximately preserving the dissimilarities of the original data: given an n x n dissimilarity 
matrix A — [Sij] for n observations with 6ij being the dissimilarity measure between the ith and jth 
observation, MDS generates embeddings Xi G M™ for the zth data point to preserve the dissimilarity 
among the objects pairs, i.e. \\xi — Xj\\ « 6ij. 

For our work two different types of dissimilarity measures are considered for English and French 





Graph Topology Dissimilarity 


Text Content Dissimilarity 


English articles {af} 


{xt}{GE) 


{xt}{TE) 


French articles {a(} 


{x{}{GF) 


{xiUTF) 



Table 3: Euclidean Embeddings (M™) for Wikipedia Articles 

articles, giving four dissimilarity matrices of dimension 1382 x 1382: the graph topology dissimilarity 
matrix A*^, A-'' and the text content dissimilarity matrix A"^, A-^. 

For the graph dissimilarities, A*^ and A-'^ are constructed based on an undirected graph G{V, E), 
where V represents the set of vertices of the 1382 Wikipedia documents, and E is the set of edges 
connecting those articles. There is an edge between two vertices if they are linked in Wikipedia. 
Then the entry A'^{i,j) is calculated from the number of steps on the shortest path from document 
i to document j in G. For the English articles, A'^{i,j) S {0, . . . ,4}, where the 4 comes from the 
2-neighborhood document collection. For the French articles Af{i,j) depends on the French graph 
connections, so it is possible that A^{i,j) ^ A'^(i,j). At the extreme end, A^{i,j) — oo when aj 
and a J are not connected, and we set A^{i,j) ~ 6 for A^{i,j) > 4. 

For the text dissimilarities, A*^ and A-^ are based on the text processing features for documents 
{af} and {a^}. Suppose Zi,Zj are the feature vectors for the ith and jth English articles. Then 
A'^(i,j) is calculated by the cosine dissimilarity A'^(«, j) = 1 — j^^rji-vr- For the experiment we 
consider the latent semantic indexing (LSI) features [S]. 

Once different dissimilarity matrices are constructed, the Euclidean space embeddings with 
m = 50 are obtained via MDS. The articles' embeddings are shown in Table [3] At first, English 
graph dissimilarity (GE) is the classification target, and others (GF, TE, TF) are treated as auxiliary 
features: all data points are used to learn the GCCA/CCA projections from K™ to M'* based on GE 
and a certain choice of auxiliary features, and the data points of GE are projected by the learned 
projections. Then 600 observations are randomly picked to train the classifier, with the remaining 
782 documents used for classification error testing. We repeat 500 times to calculate the average 
classification error, for every possible GCCA/CCA projection and various choice of d. The same 
procedure is repeated with the French graph dissimilarity (GF) being the classification target and 
the remaining being the auxiliary features. The full results for every possible projection are shown 
in Figure [l] for the classification of GE. For illustration purposes, two simplified plots are shown 
in Figure [2] for the classification of GE/GF, for which wc omit most projections in order to better 
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quantify the effects of increasing s (the number of chosen auxihary features), i.e., only the best A2 
and A3 are shown. Note that for comparison purposes the PCA projections are also included. 
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Figure 1: Classification Error for GE 

Based on Figure [2j we observe that for most choices of d the best GCCA projection A3 admits a 
lower error than the best CCA projection A2, and both of them are better than the PCA projection. 
However, it turns out that the GCCA projection A4 is much worse for classifying the Wikipedia 
data. This is not a surprise, as one can judge from Figure IT] that the choice of auxiliary features is 
crucial for the performance of GCCA/CCA projections. From a qualitative perspective, the graph 
dissimilarities GE and GF are of questionable value because they depend on the Internet links, 
while the text dissimilarities TE and TF are nuich more faithful because they are extracted from 
the document contents. Therefore it is reasonable to believe that choosing a text dissimilarity is 
better than choosing a graph dissimilarity, which explains why the best A2 and A3 do not choose 
any graph dissimilarity and why A4 performs worse. 

Unfortunately, it is not easy to check the joint distribution by Definition [2] because the optimal 
projection A* is unknown. (Even if A* is known, it is likely the conditions are not satisfied.) 
Therefore in a real-world application, one must be cautious in adding an auxiliary feature to derive 
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the projection, which can be a trial-and-error process. 



5. Proofs 

5.1. Proof of Theorem^ when K = 2 and r = 1 

Proof. We consider K — 2 and r — 1 here (and generahze in the next proof), so the number of 
classes is two and the GCCA criterion is the sum of correlations. 

If a projection A represents the same subspace as the optimal projection A* (i.e., AA =A*A* ), 
then A is optimal for classification such that La — La* ■ For most parts it suffices to assume A* is 
unique (in the sense of representing the same subspace), which is justified towards the end of the 
proof. 

In addition to the uniqueness of A* , we also assume that Hx, Hz,,, '^z^ are all identity matrices 
for s — 1,2. This is also justified later, as we will show the theorem is invariant under proper 
transformations. Further, the expectations E{X) and E{Zs) are treated as zeros throughout all 
proofs because the GCCA/CCA projections and the classification task are not affected. 

Under the above assumptions, we have the followings: the optimal projection A* is spanned by 
the first d coordinate axes; any potential projection A (z A must be orthonormal and equivalent 
to an orthogonal projection onto a dimension d linear subspace; and the GCCA/CCA projections 
Ag+i are optimal if and only if ^s+i^s+i — A* A* . 

Because all the pre-multiplication matrices are assumed to be identity matrices, together with 
conditions (1) and (2) in Definition [2] we have the covariance matrices 



^XZi 



pqnE{UiU[) + (1 - p)qi2E{U2U'^) pE{UiW[) + (1 - p)E{U2W^) 



pqiiE{WU^) + (1 - p)qi2EiWU2) 
Mil + (1 - P)gi2 







E{WWi] 



E{WWi) 
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^XZn 



pq2iE{UiU[) + (1 - p)q22E{U2U'^) pEiUiW^) + (1 - p)E{U2W2) 



pq2iE{WU[) + (1 - p)q22E{WU'^) 
pq2i + {I - p)q22 







EiWW^ 



E{WW2) 



To derive the CCA projection A2 — A2{X,Zi), the two m x d orthonormal matrices A2 and 
B2 shall maximize the singular values of A2l^xZiB2 (we take B2 = [61, ... , bd] as in Equation (llj), 
similarly to how we define A2) [9 . Because A* represents the dimension d subspace spanned by 
the first d coordinate axes, A2{X, Z\) is optimal if and only if A2 consists of the first d left singular 
vectors of Sxzi . Due to the form of SxZi , hi this case E2 must consist of the first d right singular 
vectors and the respective correlations are maximized to the decreasingly ordered singular values 
of the d X d leading principal sub- matrix of Sxzi- Therefore A2A.2 — A* A* if and only if A2 is 
spanned by the first d coordinate axes, or equivalently the largest d singular values of SxZi a-U 
come from the d x d leading principal sub-matrix. 

Putting into inequalities, the CCA projections A2{X, Zg) are optimal if and only if, 



hs=pqsi + il-p)qs2 - a,{E{WW,)) > 0. 



(6) 



When either CCA projections is not optimal, at least one hg is non-positive and represents the 
"singular value loss" of using CCA. 

To derive the GCCA projection ^3 based on {X, Zi, Z2), the covariance matrix between Zi and 
Z2 also comes into play: 

pqiiq2iE{UiU[) + (1 - p)qi2q22E{U2U'2) pqnEiUiW^) + (1 - p)qi2E{U2W2) 
pq2iE{WiU[) + (1 - p)q22E{WiU2) E{WiW2) 

pqiiq2i + (i - p)qi2q22 



-Z1Z2 







E{WiW2) 
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Argued in a similar manner, the GCCA projection is optimal if and only if A3 is spanned by the 
first d coordinate axes. The necessary and sufficient condition for that is 

h + hi + h2>0, (7) 

where we define h = pqiiq2i + (1 ^ p)qi2<l22 ~ ai{E{WiW2))- In words, if both CCA projections 
are already optimal, it is sufhcient that the largest d singular values of S^^^g ^H come from the 
dxd leading principal sub-matrix; else if either CCA projections is not optimal, the "singular value 
gain" from SziZ2 h^-s to compensate the possible "singular value loss" from SxZi and Sxz2 in 
order for the GCCA projection to be optimal. 

An important step is to prove that if /i^ > for s = 1, 2, then /i > 0. This is true because 

h = pqiiq2i + {^-p)qi2q22-(^i{E{WiW2)) 

> pqiiq2i + (1-^)912922 - <JiiE{WW'j)ai{E{WW2)) 

> pqiiq2i + (1 -p)qi2q22 - (pqn + (l -p)qi2){pq2i + (l -p)q22) 
= p(l-p)(gii -9i2)(<?2i -922) 

> 0, 

where the first inequality uses condition (3) in Definition^ the second inequality is by the fact that 
hg > 0, and the last inequality uses condition (4). 

By the above derivation, if both CCA projections are optimal such that hg > for s = 1, 2, then 
Inequality Q automatically holds and the GCCA projection ^3 is also optimal. This shows that 
any F^ g ^^ satisfying Inequality (|6| for s = 1, 2 is an element of the subset {F3 e figj maxjL^^} = 
Las = La'}- 

Next we show there exists F3 G fl,'^ such that Inequality ^ holds while Inequality ^ fails 
for at least one s. The trivial example is that: if /ii = /12 = 0, then the GCCA projection is 
optimal! Furthermore, fixing h, p and all the qsk, the left-hand side of Inequality nh is clearly 
continuous with respect to ai{E{WWg)) for each s. This means ai{E{WWg)) can be increased 
such that hg < (and condition (3) in Definition [2] will not be violated) while Inequality (It]) still 
holds. So there also exists F3 such that the GCCA projection is optimal when hg < 0. Thus 
3F3 e {F3 e nil max {La^} > La:, = La-)- 



15 



Therefore, when A* is unique and Hx, Hz^,^z, all a-U identity matrices, we proved that: for 
any given F3 £ fJg, if the CCA projections are optimal, so is the GCCA projections; if the CCA 
projections are not optimal (Inequality ^ is not satisfied for at least one s), the GCCA projection 
may be optimal (depending on whether the covariance structure satisfies Inequality ([7])). Equiva- 
lently, we demonstrate that the similarity definition is sufficient for GCCA to improve CCA. Note 
that the step that ensures h > when hg > will be used again. 

Next we show the result so far is invariant under any Hx, Hz, , S^^ that satisfy Definition[2] Take 
CCA on {X, Zi) for an example: by Equation (|3|) and Equation (U| we have S^^ = Hx^xH^ = I 
and E^ = Hz^ ^.z^H^ ; also by eigen-decomposition there exists mi x mi matrix V s.t. E^ = V V. 
Then E^ = H^^H^^' and E^^ = iJ^ ^'(i/^ V')', and the CCA formulation (B is equivalent to 



P{aiX,biZi} 



{H^'' a.,)' H'^^xz,H'zV-'{VHzl')h 



{H^''a.,)'H^''aJ{VHz^'h)'VHzl'h 



(Hv tti) Hv cij „ 

subject to P{a,X.a,X} = I ^ 2^ J ^ 

, {vii-^^;hSyiiz'ih . 

and P{6iZi.h,Zi} = , , = 0, 



{yH-^l'h)'VH-^l'h^{VH-^l'b,)'VH-^l'h 

where V~^ is defined as the unique Moore-Penrose pseudo inverse if E^ is singular. Hence it is 
equivalent to consider the projections H^^ A2 and VH^^ B2 on (X, V^'^ Zi) (both X and V^'^ Zi 
are of identity variance) with covariance H^Y^xz^^z ^^^i instead of the projections Ai and B2 
on (X,Z\). The same holds for the GCCA formulation pi. Furthermore, the classification task 
remains the same because the projected feature A X = [H^ A) HxX is invariant under the full- 
rank transformation Hx- Therefore the optimal projection A* and the GCCA/CCA projections 
Ag^i are all equivalent to the identity variance case up to Hx, and the result is clearly invariant. 
At last we justify the case when A* is not unique, which means there exists A* that is spanned 
by the first d coordinate axes under different transformation matrices. Because the conditions in 
Definition [2] are required to be satisfied for all A* , in most cases the CCA optimality is still equiv- 
alent to Inequality (Ig]), i.e., CCA is optimal if and only if Inequality ^ is satisfied for at least one 
A* after proper transformations for each A* . The same holds for the GCCA optimality (Inequal- 
ity (l7|), and we can still conclude that GCCA improves CCA following the same steps. However, a 
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special case should be taken into consideration, and we take the CCA projection A2{X, Zi) for an 
illustration: Suppose the singular vector ai{E{WWg)) corresponds to is the (d + l)th coordinate 
axes and ai{E{WWg)) > a2{E(WWg)). Then A2{X, Zi) can be chosen to represent any dimension 
d subspace of the space spanned by the first {d + 1) coordinate axes, and the degrees of freedom 
is {d + l)d — ^^"2"^ (the degrees of freedom may increase if there are repeating singular values). 
Now, if A* happens to have the same degrees of freedom in the space spanned by the first {d + 1) 
coordinate axes, then A2{X, Zi) is optimal if and only if ft,i > (rather than hi > 0) because any 
arbitrary choice of A2 is optimal. Similar phenomenon applies for As+i, in which case Inequal- 
ity ^ and Inequality ^ should be adjusted to include equalities. However, in this case we still 
have h + hi + h2 > when the CCA projections are optimal, which is still sufficient (but may not 
be necessary) for GCCA to be optimal. Therefore, GCCA still improves CCA in case of non-unique 
A* , and the justification is done. D 

5.2. Proof of Theorem^ for any K > 2 and r > 1 

Proof. Now we generalize the result to arbitrary K > 2 (multi-class) and any r > 1 (the GCCA 

criterion). Without loss of generality, we assume A* is unique and Hx, Hz^^^z, are all identity 

matrices. 

Using the setting in Equation Q and argue similarly as before, GCCA improves CCA if and 
only if 

K 

/i = 5IPfe9i'«92fc-^i(^(W^iW'2))>0 (8) 



fc=i 



is true when hs = Ylk=iPk1sk - cri{E{WW'^)) > for s = 1, 2. 
This is true because 



K 



h = ^Pkqikq2k-cri{E{WiW2)) 

k=l 
K 

Y,Pkqikq2k - ai{E{WW'i))ai{E{WW2)) 



> 



fc=i 

K 



> ^Pkqikq2k-hih2 

k=l 

> X! PkiPkMiki ~ qik2){q2kt - q2k2) 



l<ki<k2<K 

> 0, 
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where the inequahties again foUow from conditions (3) and (4) and simple algebra. 
As to the GCCA criterion with r > 1, GCCA improves CCA if and only if 



K 



{J2pkqikq2kV - aliEiWiW^)) > 



fc=i 



is true when hg > 0. Clearly this inequality holds if and only if it holds for r = 1, which is 
Inequality (|8|. Hence it is true and GCCA improves CCA in the similar family for any r > 1. 
Thus Theorem [l] is proved for any number of classes and any GCCA criterion with r > 1. D 

5.3. Proof of Corollary^ and Corollary^ 

Proof. Without loss of generality, we carry out the proof assuming A* is unique, Hx,Hz,,^Zs 

are all identity matrices, and K — 2 and r = 1. 



There are S auxiliary features in total, and thus ( ^, ) choices of auxiliary features for Ag'^-^^. 



We define /i,, ^ pQsi + {1 - p)qs2 - cri{E{WW'J) and h^t = pqsiqti + C^^ - p)qs2qt2 - cri{E{W,Wt)) 

for any s and t satisfying S > s,t > 1, where hgt is a generalization of h in the proof of Theorem [Tl 

Then the GCCA projection Ag' ^-^ using the first S auxiliary features is optimal if and only if 

s' 
Y, hst + Y,hs>0. (9) 

l<5<t<S' «=1 



This is a generalization of Inequality ([7| , because there are S possible "singular value loss" caused 
by SxZs a-iid — ^-^ — - additional cross-covariance terms Sz^Zt between the auxiliary features. 
Note that for any other Ag'_^^ E -^s'+i '^ith a different choice of auxiliary features, we can still 
use Inequality (l9|) for the optimality by switching the first S auxiliary features with the chosen S 
auxiliary features. 

All the CCA projections are optimal if and only ii hg > for all s = 1, . . . , S". This implies 
that hst > is always true for any 1 < s < t < 5, and Inequality (|9|) holds for any Ag',^ e Ag',i 
with S > S > 2. Therefore the set of GCCA projections Ag'^i always improves the set of CCA 
projections A2, and Corollary IT] is proved. 

To prove Corollary [2] we use the simplifying condition (4*). Then Inequality ^ simplifies to 
^ ^^ hi2 + ft-i > 0, because hgt are the same for all 1 < s,i < 5* and so are hs- We need to show 
that if Agt are optimal for certain i^5+i, so is Agi ,^. (note that the choice of auxiliary features no 
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longer matters because they follow the same distribution, which means all the elements in Ag',i 
represent the same subspace.) 

When S = 2, it is a special case of Theorem 111 because any Fg+i satisfying condition (4*) also 
satisfies condition (4). Clearly A2 is optimal if and only if hi = /i2 > 0, which implies hi2 > 0. So 
Inequality (|9| holds and A3 is also optimal. 

When S = 3, ^3 is optimal if and only if hi2 + hi > 0. In this case if hi > 0, then we 
have hi2 > 0; if hi < 0, then hi2 > must be true in order for ^3 to be optimal. In any case, 
|/ii2 + /ii > is true and A4 is optimal. 

Therefore, the optimality of A3 implies the optimality of 744. By induction, for any S > S > 
2, the optimality of Ag' implies the optimality of ^5'^^ under the simplifying condition (4*). 
Corollary ^ is proved. (Unfortunately this is not true under the original condition (4) , and one can 
easily make up a counter-example by checking Inequality ([9|.) D 
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